Motivation—Tim

This year, a report from Centers for Disease Control and Prevention revealed that America’s obesity rate has reached a record high. Contrasting to popular belief, New Yorkers are not so safe from the obesity epidemic, as more than half of adult New Yorkers are either overweight or obese. Studies show that the rise in the obesity epidemic is partly due to disparities in food environment; it is harder for some to eat healthier because their options are limited. This project intends to look deeper into the relationship between food environment in NYC and obesity rate along with diabetes rate and stroke hospitalization rate.

Initial questions—Tim

1. Can the percentage of certain fast-food restaurant be related to obesity, diabetes and stroke rate in different neighborhoods of NYC?

2. Can the percentage of health inspection grade A restaurant be related to obesity, diabetes and stroke rate in different neighborhoods of NYC?

3. Can the composite restaurant health score of each neighborhood be related to chronic disease rate?

For the first two questions, we stick to it. For the third question, we change the main predictor to percentage of fast food cuisine type since it was easy to acquire in the data. Composite restaurant health score were to subjective and difficult to assess. Moreover, we calculated the boro level model first due to the difficulties we encountered in scaping zipcode-neighborhood data. However, after some struggle, we still managed to get the neighborhood information and analyze the data accordingly.

Data and Methods

Data Source and Collection—Tim

1. NYC restaurant inspection:

https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j

2. 2015 Community Health Profiles Open Data:

https://www1.nyc.gov/site/doh/data/data-publications/profiles.page

These data were rather easy to acquire, since they can be downloaded by CSV format. Because health and demographic data are grouped by neighborhood but the restaurant inspection data was using zipcodes. We had to match the zipcode and neighborhood name in order to merge the datasets which was the hardest.

First, we scraped the table data from https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm. However the neighborhood name and combinations were different from the health profile data we downloaded. We cannot merge the two datasets. Then we tried to look for the raw classification for the neigborhoods in health data from New York University’s Furman Center for Real Estate and Urban Policy and the NYC Department of City Planning. However, still no luck becuase the neighborhood was not divided by zipcode areas. Then we tried the hardest way: search the name of the neighborhood on Google and finding the matching zipcodes. It worked out well, but it was not precise, thus we left out some of the restaurants without a matching neighborhood name.

Data Import and Cleaning—Tim

Downloading restaurant inspection data from NYC open data:

get_all_inspections = function(url) {
  
  all_inspections = vector("list", length = 0)
  
  loop_index = 1
  chunk_size = 50000
  DO_NEXT = TRUE
  
  while (DO_NEXT) {
    message("Getting data, page ", loop_index)
    
    all_inspections[[loop_index]] = 
      GET(url,
          query = list(`$order` = "zipcode",
                       `$limit` = chunk_size,
                       `$offset` = as.integer((loop_index - 1) * chunk_size)
                       )
          ) %>%
      content("text") %>%
      fromJSON() %>%
      as_tibble()
    
    DO_NEXT = dim(all_inspections[[loop_index]])[1] == chunk_size
    loop_index = loop_index + 1
  }
  
  all_inspections
  
}

url = "https://data.cityofnewyork.us/resource/9w7m-hzhe.json"

rest_inspection = get_all_inspections(url) %>%
  bind_rows() 

Downloading health and demographic data CSV into local folder and reading, cleaning it:

download.file("https://www1.nyc.gov/assets/doh/downloads/excel/episrv/2015_CHP_PUD.xlsx", mode="wb", destfile = "health.xlsx")
health <- read_excel("health.xlsx", sheet = "CHP_all_data") %>% 
  select(Name, Racewhite_Rate, Age0to17_rate, 
         Age18to24_rate, Age25to44_rate, Poverty, Unemployment,
         Smoking, Exercise,
         Obesity, Diabetes, Stroke_Hosp) %>% 
  clean_names() %>% 
  mutate(age0to44_rate = age0to17_rate + age18to24_rate + age25to44_rate) %>% 
  select(-age0to17_rate, -age18to24_rate, -age25to44_rate)

Matching the neighborhood names with restaurant zipcodes:

zip_neighbor <- read_csv("neigh_zipcode.csv") %>% 
  mutate(zipcode = as.character(zipcode))
##restaurant data with neighbourhood
rest_neighborhood = left_join(rest_inspection, zip_neighbor, by = "zipcode") %>% 
  filter(!is.na(neighborhood))

Exploratory Analysis

Health—Tim&Austin

health_only_neighborhood <- health[-c(1:6),] %>% 
  rename(neighborhood = name) %>% 
  mutate(neighborhood = as.factor(neighborhood))
##Plotting for outcome in different neighborhood


health_only_neighborhood %>%
  mutate(neighborhood = fct_reorder(neighborhood, obesity)) %>% 
  filter(obesity > 25) %>% 
  ggplot(aes(x = neighborhood,y = obesity, fill = neighborhood)) + geom_col() +  
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        legend.position = "none") +
  labs(title = "Neighborhood with 25 percent more obesity rate")

health_only_neighborhood %>%
  mutate(neighborhood = fct_reorder(neighborhood, diabetes)) %>% 
  filter(diabetes > 10) %>% 
  ggplot(aes(x = neighborhood,y = diabetes, fill = neighborhood)) + geom_col() +   
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        legend.position = "none") +
  labs(title = "Neighborhood with 10 percent more diabetes rate")

health_only_neighborhood %>%
  mutate(neighborhood = fct_reorder(neighborhood, stroke_hosp)) %>% 
  filter(stroke_hosp > 300) %>% 
  ggplot(aes(x = neighborhood,y = stroke_hosp, fill = neighborhood)) + 
  geom_col() +    
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        legend.position = "none") +
  labs(title = "Neighborhood with 300 more stroke hospitalization in 100,000 adults")

##Plotting for some adjusted variables


health_only_neighborhood %>%
  mutate(neighborhood = fct_reorder(neighborhood, poverty)) %>% 
  filter(poverty > 20) %>% 
  ggplot(aes(x = neighborhood,y = poverty, fill = neighborhood)) + geom_col() +  
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none") +
  labs(title = "Neighborhood with 20 percent or more poverty")

health_only_neighborhood %>%
  mutate(neighborhood = fct_reorder(neighborhood, age0to44_rate)) %>% 
  filter(age0to44_rate > 60) %>% 
  ggplot(aes(x = neighborhood,y = age0to44_rate, fill = neighborhood)) + 
  geom_col() +    
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none") +
  labs(title = "Neighborhood with 60 percent or more people under age 44")

Summary: Williamsbridge and Baychester have the highest obesity rate, up to 32%. East New York and Starrett City have the highest diabetes rate, up to 17 percent. BUshwick has the highest stroke hospitalization in 100,000 adults, up to 300 adults. Morrisania and Crotona has the highest percent of poverty rate, up to 40%. Fiancial district has the highest percent of young peoeple, up to 67%.

Restaurant data

Fastfood restaurants by cuisine type—Ruiwei

# Identify fastfood restaurants by their cuisine types. We print out the cuisine type list (n=85) and let everyone circle the ones they think is fastfood and we use the union as our rule (use union because it's more conservative).

rest_inspection %>%
  mutate(dba = str_to_upper(dba)) %>%
  arrange(cuisine_description) %>%
  count(cuisine_description)
# calculating the total number of restaurants and the number of fastfood restaurants in the neighborhood, as well as the percentage of fastfood restaurants.

neighborhood_list = 
  rest_neighborhood %>%
  distinct(neighborhood) %>%
  arrange(neighborhood)
  
rest_fastfood_neighborhood = 
  rest_neighborhood %>%
  filter(cuisine_description %in% c("Bagels/Pretzels",
                                    "Bottled beverages, including water, sodas, juices, etc.",
                                    "Chicken",
                                    "Delicatessen",
                                    "Donuts",
                                    "Hamburgers",
                                    "Hotdogs",
                                    "Hotdogs/Pretzels",
                                    "Ice Cream, Gelato, Yogurt, Ices",
                                    "Nuts/Confectionary",
                                    "Pancakes/Waffles",
                                    "Pizza",
                                    "Soul Food",
                                    "Sandwiches",
                                    "Sandwiches/Salads/Mixed Buffet",
                                    "Soups & Sandwiches"))

percent_fastfood_neighborhood = function(name_neighborhood){

  rest_each_neighborhood =
    rest_neighborhood %>%
    filter(neighborhood == name_neighborhood) %>%
    distinct(camis)
  
  n_rest_neighborhood = nrow(rest_each_neighborhood)

  rest_fastfood_distinct_neighborhood = 
    rest_fastfood_neighborhood %>%
    filter(neighborhood == name_neighborhood) %>%
    distinct(camis, cuisine_description)
  
  n_fastfood_neighborhood = nrow(rest_fastfood_distinct_neighborhood)
    
  percent_fastfood_neighborhood = n_fastfood_neighborhood/n_rest_neighborhood
  
  tibble(
    neighborhood = name_neighborhood,
    n_fastfood = n_fastfood_neighborhood,
    n_rest = n_rest_neighborhood,
    percent_fastfood = percent_fastfood_neighborhood
  )
}

fastfood_neighborhood = 
  map(neighborhood_list$neighborhood, percent_fastfood_neighborhood) %>%
  bind_rows() %>%
  mutate(neighborhood = str_to_upper(neighborhood))
fastfood_neighborhood %>%
  mutate(neighborhood = as.factor(neighborhood),
         n_rest = as.numeric(n_rest),
         neighborhood = fct_reorder(neighborhood, percent_fastfood)) %>%
  plot_ly(., x = ~neighborhood, y = ~n_rest, type = 'bar', name = 'total restaurants') %>%
  add_trace(y = ~n_fastfood, name = 'fastfood restaurants') %>%
  layout(yaxis = list(title = 'number of restaurants'), 
         xaxis = list(title = 'neighborhood (ordered by percentage of fastfood restaurants)',
                      tickangle = 45),
         barmode = 'stack')

From the plot, we can see that, while the Greenwich Village and SOHO neighborhood has fairly large number of restaurants, it has the smallest percentage of fastfood restaurants. Williamsbridge and Baychester has the largest percentage of fastfood restaurants.

When large number of total restaurants is not equal to large percentage of fastfood restaurants in that neighborhood, we can conclude that the distribution of fastfood restaurants is not even across neighborhoods, which also implies the motivation of our study, we want to investigate if this uneven distribution of fastfood restaurants is associated with diffferent level of chronic disease outcomes within a neighborhood.

Restaurant Chains—Sara

## Matching list of chain restaurants in U.S. and restaurant inspection data

chains_html = read_html("https://en.wikipedia.org/wiki/List_of_restaurant_chains_in_the_United_States#Fast-casual")

chain_rest = chains_html %>%
  html_nodes("ul:nth-child(9) li , .jquery-tablesorter tr:nth-child(1) td:nth-child(1)") %>%
  html_text() %>%
  as.tibble() %>%
  mutate(dba = value, 
         dba = str_to_upper(dba)) %>%
  select(dba)
# read in the list of chain restaurants in us 
# made the names to uppercase and changed the var name to dba
head(chain_rest, 10)
## # A tibble: 10 x 1
##                       dba
##                     <chr>
##  1        A&W RESTAURANTS
##  2             APPLEBEE'S
##  3             BAJA FRESH
##  4          BOSTON MARKET
##  5     BUFFALO WILD WINGS
##  6                CHILI'S
##  7 CHIPOTLE MEXICAN GRILL
##  8           CICI'S PIZZA
##  9    COLD STONE CREAMERY
## 10     CORNER BAKERY CAFE

I have scraped list of 75 national chain restaurants from the wikipedia page (https://en.wikipedia.org/wiki/List_of_restaurant_chains_in_the_United_States#Fast-casual) now I will join this dataset with restaurant inspection data to choose only the chain restaurants in NYC.

# removing punctuations in chain_rest & restaurant inspections
chain_rest_str = 
  chain_rest  %>%
  mutate(dba = str_replace_all(dba, "[[:punct:]]", ""))

rest_inspect_str = rest_inspection %>% 
  mutate(dba = str_replace_all(dba, "[[:punct:]]", ""))

# Matching the two datasets(restaurant inspection data that has all punctuation removed from dba(restaurant name) and list of chain restaurants by dba)
nyc_chain = 
  right_join(rest_inspect_str, chain_rest_str) %>%
  filter(!is.na(camis)) %>%
  distinct(camis, dba, boro)
## Joining, by = "dba"
nyc_chain %>%
  group_by(dba) %>%
  summarise(n = n()) %>%
  arrange(desc(n)) %>%
  head(10)
## # A tibble: 10 x 2
##                       dba     n
##                     <chr> <int>
##  1          DUNKIN DONUTS   453
##  2                 SUBWAY   344
##  3              STARBUCKS   282
##  4              MCDONALDS   207
##  5 CHIPOTLE MEXICAN GRILL    66
##  6                 WENDYS    42
##  7              APPLEBEES    28
##  8           WHITE CASTLE    25
##  9          BOSTON MARKET    22
## 10                   IHOP    21

The combined dataframe, nyc_chain has nrow(nyc_chain) observations. Also, there were 33 different chain restaurants extracted. The chart is showing the 10 chain restaurants in descending order. Next step is calculating the percentage of chain restaurants in each boro.

# counting chains in boro

count_chain = nyc_chain %>%
  group_by(boro) %>%
  summarise(chain_n = n())

count_rest = rest_inspection %>%
  distinct(boro, camis) %>%
  group_by(boro) %>%
  summarise(res_n = n())

# calculating percentage of chains in each boro
percent_chain = left_join(count_chain, count_rest) %>%
  mutate(chain_percentage = chain_n/res_n)
## Joining, by = "boro"
# let's see this in a graph
percent_chain %>%
  mutate(boro = forcats::fct_reorder(boro, chain_percentage)) %>%
   ggplot(aes(boro, chain_percentage, fill = boro)) + geom_bar(stat="identity")

We can see that Brooklyn has the lowest percentage of chain restauratns with percentage less that 5%. Next was Manhattan with 6% and the highest was Staten Island with 10% of chain restaurants.

Inspection Grade—Austin

data1 =
  rest_neighborhood %>%
  group_by(neighborhood, grade) %>%
  summarise(n = n()) %>%
  mutate(grade_percent = n / sum(n)) %>% 
  filter(grade == "A") %>% 
  ungroup(boro)
data1 %>% 
  mutate(neighborhood = as.factor(neighborhood),
         neighborhood = fct_reorder(neighborhood,grade_percent)) %>% 
  plot_ly(x = ~neighborhood, y = ~grade_percent, color = ~neighborhood, type = "bar")
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

Different from the situation in borough, the difference of percentage of “grade-A” restaurants in each borough exists. “Throgs Neck and Co-op City” has the greatest grade-A restaurants percentage, around 44.50%. “Sunset Park”, however, has the least, around 30.62%. “Washington Heights”, where we live, takes the fourth from the end, 34.39%, which is obviously consistent with the feeling we have towards the restaurant condition of “Washington Heights”.

Formal Analysis and Findings

Fastfood restaurants by cuisine type—Ruiwei

# combine the dataset with health data and run linear simple regression models

health_neighborhood = 
  health %>%
  mutate(neighborhood = str_to_upper(name)) %>%
  select(-name)

combined_neighborhood = 
  left_join(fastfood_neighborhood, health_neighborhood, by = "neighborhood")

# fit multiple linear regression model between the three health outcomes and percentage of fastfood restaurants after adjusting fro race, age, poverty, smoking status and exercise status

lm(obesity ~ percent_fastfood + racewhite_rate + age0to44_rate + poverty + smoking + exercise, combined_neighborhood) %>%
  broom::tidy() %>%
  kable()
term estimate std.error statistic p.value
(Intercept) 25.7821960 13.6538571 1.8882720 0.0645740
percent_fastfood 48.7745006 13.8250910 3.5279696 0.0008850
racewhite_rate -0.1074704 0.0405247 -2.6519723 0.0105840
age0to44_rate -0.0871726 0.1529002 -0.5701272 0.5710466
poverty 0.1154709 0.1169470 0.9873783 0.3280294
smoking 0.7562470 0.2477709 3.0522026 0.0035739
exercise -0.2082229 0.1561606 -1.3333889 0.1882161
lm(diabetes ~ percent_fastfood + racewhite_rate + age0to44_rate + poverty + smoking + exercise, combined_neighborhood) %>%
  broom::tidy() %>%
  kable()
term estimate std.error statistic p.value
(Intercept) 17.5350180 5.7068880 3.0726060 0.0033738
percent_fastfood 17.5639085 5.7784584 3.0395492 0.0037036
racewhite_rate -0.0778512 0.0169381 -4.5962285 0.0000278
age0to44_rate 0.0056160 0.0639075 0.0878768 0.9303121
poverty 0.0443508 0.0488802 0.9073366 0.3684145
smoking 0.1527703 0.1035605 1.4751785 0.1461941
exercise -0.1465556 0.0652703 -2.2453643 0.0290222
lm(stroke_hosp ~ percent_fastfood + racewhite_rate + age0to44_rate + poverty + smoking + exercise, combined_neighborhood) %>%
  broom::tidy() %>%
  kable()
term estimate std.error statistic p.value
(Intercept) -0.6792658 143.2162470 -0.0047429 0.9962338
percent_fastfood 439.1079799 145.0123308 3.0280734 0.0038249
racewhite_rate -1.5938059 0.4250665 -3.7495447 0.0004464
age0to44_rate 0.2610684 1.6037811 0.1627830 0.8713202
poverty 1.7935986 1.2266651 1.4621746 0.1497111
smoking 6.4527974 2.5988861 2.4829089 0.0162954
exercise 1.6711661 1.6379797 1.0202605 0.3123283

According to the results of multiple linear regressions, we can conclude that after adjusting for race, age, poverty, smoking status and exercise status, there is significant relationships between the three chronic disease outcomes and the percentage of fastfood restuarants within boroughs. (Obesity: p-value=0.000909491; Diabetes: p-value=3.756556e-03; Hospitalized Stroke: p-value=0.0039134849)

Restaurant Chains—Sara

# removing punctuations in restaurant inspections
rest_neigh_str = rest_neighborhood %>% 
  mutate(dba = str_replace_all(dba, "[[:punct:]]", ""))

# Matching the two datasets(neighborhood restaurant inspection data that has all punctuation removed from dba(restaurant name) and list of chain restaurants by dba)
neighborhood_chain = 
  right_join(rest_neigh_str, chain_rest_str) %>%
  filter(!is.na(camis)) %>%
  distinct(camis, dba, neighborhood)
## Joining, by = "dba"
# counting chains in neighborhoods

neigh_count_chain = neighborhood_chain %>%
  group_by(neighborhood) %>%
  summarise(chain_n = n())

neigh_count_rest = rest_neighborhood %>%
  distinct(neighborhood, camis) %>%
  group_by(neighborhood) %>%
  summarise(res_n = n())

# calculating percentage of chains in each boro
percent_neighborhood_chain = left_join(neigh_count_chain, neigh_count_rest) %>%
  mutate(chain_percentage = chain_n/res_n)
## Joining, by = "neighborhood"
# joinining with health data
health_chain_neighborhood = left_join(percent_neighborhood_chain, health_only_neighborhood) 
## Joining, by = "neighborhood"
## Warning: Column `neighborhood` joining character vector and factor,
## coercing into character vector
# function for modeling health outcome and percentage of chain restaurants

lm_neighborhood = function(health_outcome){
  
 graphs_chain_health = health_chain_neighborhood %>%
  mutate(boro = forcats::fct_reorder(neighborhood, health_outcome)) %>%
  ggplot(aes(chain_percentage, health_outcome, color = neighborhood)) + geom_point() + theme(legend.position="none")
 
 reg_chain = lm(health_outcome ~ chain_percentage + racewhite_rate + age0to44_rate + poverty + smoking + exercise, data = health_chain_neighborhood) %>%
   summary()

list(graphs_chain_health, reg_chain)
# function that takes a input of column in health_chain_neighborhood and gives output of:
 
# 1. graphs_health will give a graph showing the relationship between health outcomes in neighborhoods and their percentage of chain restaurants

# 2.reg_chain gives an output of the linear regression result
}
lm_neighborhood(health_chain_neighborhood$obesity)
## [[1]]

## 
## [[2]]
## 
## Call:
## lm(formula = health_outcome ~ chain_percentage + racewhite_rate + 
##     age0to44_rate + poverty + smoking + exercise, data = health_chain_neighborhood)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.7420  -2.9724  -0.8885   3.7028   8.7493 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)      27.43030   14.43133   1.901  0.06288 . 
## chain_percentage 62.89506   23.41510   2.686  0.00968 **
## racewhite_rate   -0.13297    0.04047  -3.286  0.00182 **
## age0to44_rate    -0.08772    0.16061  -0.546  0.58729   
## poverty           0.17525    0.12581   1.393  0.16956   
## smoking           0.82845    0.25612   3.235  0.00212 **
## exercise         -0.19105    0.16346  -1.169  0.24783   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.925 on 52 degrees of freedom
## Multiple R-squared:  0.6608, Adjusted R-squared:  0.6217 
## F-statistic: 16.88 on 6 and 52 DF,  p-value: 1.062e-10
# I tried log transformation and it's even better!

#health_chain_neighborhood %>%
#  mutate(boro = forcats::fct_reorder(neighborhood, obesity)) %>%
#  ggplot(aes(chain_percentage, obesity^2, color = neighborhood)) + geom_point() + #theme(legend.position="none")
#
#lm(obesity^2 ~ chain_percentage + racewhite_rate + age0to44_rate + poverty + smoking + #exercise, data = health_chain_neighborhood) %>%
#   summary()

Fitting a regression line with predictors chain_percentage, racewhite_rate, age0to44_rate, poverty,smoking, and exercise, we can see that the percentage of chains is a significant predictor with p-value, 0.01057. Holding all the other predictors constant, 62.2% increase in percentage of obesity in NYC neighborhoods is associated with 1% increase in chain restaurants. From the graph of percentage of chain restaurants and percentage of obesity, we can see there is a positive trend.

lm_neighborhood(health_chain_neighborhood$diabetes)
## [[1]]

## 
## [[2]]
## 
## Call:
## lm(formula = health_outcome ~ chain_percentage + racewhite_rate + 
##     age0to44_rate + poverty + smoking + exercise, data = health_chain_neighborhood)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0861 -1.1391 -0.1949  1.3621  4.4662 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      18.193485   5.976656   3.044  0.00366 ** 
## chain_percentage 22.349494   9.697233   2.305  0.02521 *  
## racewhite_rate   -0.087199   0.016759  -5.203 3.37e-06 ***
## age0to44_rate     0.005019   0.066517   0.075  0.94015    
## poverty           0.065475   0.052105   1.257  0.21452    
## smoking           0.179443   0.106069   1.692  0.09668 .  
## exercise         -0.140563   0.067697  -2.076  0.04282 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.04 on 52 degrees of freedom
## Multiple R-squared:  0.7572, Adjusted R-squared:  0.7292 
## F-statistic: 27.03 on 6 and 52 DF,  p-value: 2.31e-14

We can also visually see there is positive correlation between diabetes and percentage of chains. From the linear regression analysis, the results show that percentage of chain restaurants in neighborhoods is a significant predictor of percentage of diabetics in NYC neighborhoods

lm_neighborhood(health_chain_neighborhood$stroke_hosp)
## [[1]]

## 
## [[2]]
## 
## Call:
## lm(formula = health_outcome ~ chain_percentage + racewhite_rate + 
##     age0to44_rate + poverty + smoking + exercise, data = health_chain_neighborhood)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -134.604  -30.567    2.136   25.483  141.942 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       81.6054   155.8476   0.524  0.60277    
## chain_percentage 255.5637   252.8656   1.011  0.31685    
## racewhite_rate    -1.9933     0.4370  -4.561 3.13e-05 ***
## age0to44_rate     -0.1588     1.7345  -0.092  0.92739    
## poverty            1.9123     1.3587   1.407  0.16524    
## smoking            7.8010     2.7659   2.820  0.00677 ** 
## exercise           1.6262     1.7653   0.921  0.36119    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 53.19 on 52 degrees of freedom
## Multiple R-squared:  0.6404, Adjusted R-squared:  0.5989 
## F-statistic: 15.43 on 6 and 52 DF,  p-value: 4.58e-10

When fitting chain_percentage, racewhite_rate, age0to44_rate, poverty,smoking, and exercise on rate of stroke however, the outcomes indicates that chain_percentage is not a significant predictor at significance level 0.05.

Inspection Grade—Austin

data2 =
  data1 %>% 
  mutate(neighborhood = str_to_upper(neighborhood))

combined1 = 
  left_join(data2, health_neighborhood, by = "neighborhood")

First, combine the data which shows the health data of each neighborhood with the dataset which shows the percentage of “grade-A” restaurants in each neighborhood. Then we acquire the final dataset, we can do the linear regression between obesity, diabetes and stroke with the percentage of “grade-A” restaurants, which we treat as a represent of the healthy restaurants in that neighborhood.

reg4 <- lm(obesity ~ grade_percent + racewhite_rate + poverty + unemployment + smoking + exercise + diabetes + stroke_hosp + age0to44_rate, data = combined1 )
summary(reg4)
## 
## Call:
## lm(formula = obesity ~ grade_percent + racewhite_rate + poverty + 
##     unemployment + smoking + exercise + diabetes + stroke_hosp + 
##     age0to44_rate, data = combined1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.1700  -2.0304  -0.3021   2.0637   8.0918 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      9.51605   12.87776   0.739  0.46346    
## grade_percent  -17.55024   20.64471  -0.850  0.39940    
## racewhite_rate   0.06956    0.04334   1.605  0.11494    
## poverty         -0.12597    0.11180  -1.127  0.26533    
## unemployment     0.54499    0.30489   1.787  0.08004 .  
## smoking          0.38736    0.20439   1.895  0.06398 .  
## exercise        -0.11180    0.12819  -0.872  0.38738    
## diabetes         1.25042    0.24965   5.009 7.51e-06 ***
## stroke_hosp      0.02980    0.01091   2.730  0.00876 ** 
## age0to44_rate   -0.06777    0.12009  -0.564  0.57514    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.616 on 49 degrees of freedom
## Multiple R-squared:  0.8277, Adjusted R-squared:  0.796 
## F-statistic: 26.15 on 9 and 49 DF,  p-value: 8.825e-16
broom::tidy(reg4)
##              term     estimate   std.error  statistic      p.value
## 1     (Intercept)   9.51605092 12.87775730  0.7389525 4.634610e-01
## 2   grade_percent -17.55023512 20.64471083 -0.8501081 3.993999e-01
## 3  racewhite_rate   0.06956019  0.04334226  1.6049045 1.149415e-01
## 4         poverty  -0.12597457  0.11180228 -1.1267621 2.653310e-01
## 5    unemployment   0.54498834  0.30489039  1.7874894 8.004392e-02
## 6         smoking   0.38735698  0.20439387  1.8951496 6.398006e-02
## 7        exercise  -0.11180414  0.12819321 -0.8721534 3.873783e-01
## 8        diabetes   1.25041638  0.24965424  5.0085927 7.511381e-06
## 9     stroke_hosp   0.02980073  0.01091430  2.7304293 8.764716e-03
## 10  age0to44_rate  -0.06776658  0.12009361 -0.5642813 5.751367e-01

p-value is smaller than 0.05, reject the null, meaning that there is a linear association between the obesity and the percentage of grade A restaurant in each neighborhood.

reg5 <- lm(diabetes ~ grade_percent + racewhite_rate + poverty + unemployment + smoking + exercise + obesity + stroke_hosp + age0to44_rate, data = combined1 )
summary(reg5)
## 
## Call:
## lm(formula = diabetes ~ grade_percent + racewhite_rate + poverty + 
##     unemployment + smoking + exercise + obesity + stroke_hosp + 
##     age0to44_rate, data = combined1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4308 -1.0375 -0.0579  0.7548  4.3168 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     7.389110   5.932974   1.245 0.218896    
## grade_percent  15.613823   9.417344   1.658 0.103710    
## racewhite_rate -0.071473   0.017999  -3.971 0.000234 ***
## poverty         0.057859   0.052046   1.112 0.271702    
## unemployment   -0.147801   0.144908  -1.020 0.312755    
## smoking        -0.029270   0.098453  -0.297 0.767497    
## exercise       -0.083995   0.058908  -1.426 0.160248    
## obesity         0.270795   0.054066   5.009 7.51e-06 ***
## stroke_hosp    -0.001295   0.005449  -0.238 0.813060    
## age0to44_rate   0.011646   0.056044   0.208 0.836247    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.683 on 49 degrees of freedom
## Multiple R-squared:  0.8442, Adjusted R-squared:  0.8156 
## F-statistic: 29.51 on 9 and 49 DF,  p-value: < 2.2e-16
broom::tidy(reg5)
##              term    estimate   std.error  statistic      p.value
## 1     (Intercept)  7.38910959 5.932973952  1.2454310 2.188962e-01
## 2   grade_percent 15.61382328 9.417343967  1.6579859 1.037096e-01
## 3  racewhite_rate -0.07147265 0.017998881 -3.9709496 2.341093e-04
## 4         poverty  0.05785864 0.052046265  1.1116770 2.717019e-01
## 5    unemployment -0.14780087 0.144907738 -1.0199654 3.127545e-01
## 6         smoking -0.02926982 0.098453156 -0.2972969 7.674966e-01
## 7        exercise -0.08399495 0.058908112 -1.4258638 1.602476e-01
## 8         obesity  0.27079499 0.054066083  5.0085927 7.511381e-06
## 9     stroke_hosp -0.00129547 0.005448697 -0.2377578 8.130603e-01
## 10  age0to44_rate  0.01164590 0.056043871  0.2077997 8.362466e-01

p-value smaller than 0.05, reject the null, meaning that there is a linear association between the diabetes and the percentage of grade A restaurant in each neighborhood.

reg6 <- lm(stroke_hosp ~ grade_percent + racewhite_rate + poverty + unemployment + smoking + exercise + diabetes + obesity + age0to44_rate, data = combined1 )
summary(reg6)
## 
## Call:
## lm(formula = stroke_hosp ~ grade_percent + racewhite_rate + poverty + 
##     unemployment + smoking + exercise + diabetes + obesity + 
##     age0to44_rate, data = combined1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -96.313 -25.002   2.209  22.423 129.730 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)     -10.7866   157.8985  -0.068  0.94581   
## grade_percent  -260.9177   250.8403  -1.040  0.30336   
## racewhite_rate   -0.7002     0.5329  -1.314  0.19503   
## poverty          -0.7829     1.3763  -0.569  0.57209   
## unemployment      8.6891     3.6309   2.393  0.02058 * 
## smoking           3.5201     2.5327   1.390  0.17086   
## exercise          1.8603     1.5527   1.198  0.23665   
## diabetes         -0.8895     3.7412  -0.238  0.81306   
## obesity           4.4313     1.6229   2.730  0.00876 **
## age0to44_rate     1.1746     1.4596   0.805  0.42485   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.1 on 49 degrees of freedom
## Multiple R-squared:  0.767,  Adjusted R-squared:  0.7242 
## F-statistic: 17.92 on 9 and 49 DF,  p-value: 1.106e-12
broom::tidy(reg6)
##              term     estimate   std.error  statistic     p.value
## 1     (Intercept)  -10.7866300 157.8984788 -0.0683137 0.945813925
## 2   grade_percent -260.9177119 250.8403054 -1.0401746 0.303364248
## 3  racewhite_rate   -0.7001605   0.5329306 -1.3137932 0.195033842
## 4         poverty   -0.7828686   1.3763477 -0.5688014 0.572088891
## 5    unemployment    8.6891360   3.6308590  2.3931351 0.020581132
## 6         smoking    3.5200670   2.5327013  1.3898469 0.170857798
## 7        exercise    1.8602912   1.5527164  1.1980882 0.236645761
## 8        diabetes   -0.8894987   3.7411965 -0.2377578 0.813060320
## 9         obesity    4.4312954   1.6229299  2.7304293 0.008764716
## 10  age0to44_rate    1.1745939   1.4595775  0.8047493 0.424851956

Surprise! We get an insignificant variable! P-value is higher than 0.05, suggesting there is no linear association between stroke and the percentage of grade A restaurant in each neighborhood.

Conclusion—to do later

write

Discussion—to do later

write